1
The Performance Paradox: Why Correct Code is Slow
AI023 Lesson 6
00:00

The Performance Paradox states that a mathematically perfect kernel, such as $out = x + y$, can actually perform worse than a CPU loop if it fails to amortize the fixed costs of the GPU hardware. This often manifests as the Launch Tax.

1. The "Correctness" Fallacy

Functional correctness is not a proxy for efficiency. While your Triton code might correctly distribute work across thousands of threads, if the total amount of work (N) is small, the GPU remains underutilized. The hardware spends more time in state transitions than in actual arithmetic.

2. The Python Measurement Trap

Benchmarking GPU code from Python using time.time() is dangerous. GPU calls are asynchronous; Python merely queues the command and moves on. Without torch.cuda.synchronize(), you measure the queueing time. With synchronization, you measure the Host-to-Device latency, which is often 10x longer than the kernel execution itself.

Python/Driver Overhead (Launch Tax)KernelTime (ms)

3. Latency vs. Throughput

To overcome the paradox, you must provide enough work to "hide" the launch latency. This is the transition from a latency-bound regime (limited by the CPU-GPU bus) to a throughput-bound regime (limited by GPU memory or compute).

main.py
TERMINAL bash — 80x24
> Ready. Click "Run" to execute.
>